A Statistical Approach to Performance Monitoring 
in Soft Real-Time Distributed Systems 



O 

o 

(N 
oo 



Danny Bickson, Gidon Gershinsky, Ezra N. Hoch and Konstantin Shagin 

IBM Haifa Research Lab, 
Mount Carmel, Haifa 31905, Israel, 
{dannybi,gidon,ezrah,konst}@ il.ibm.com 



> 
00 

O 

o 



X 



Abstract 

Soft real-time applications require timely delivery of 
messages conforming to the soft real-time constraints. Sat- 
isfying such requirements is a complex task both due to 
the volatile nature of distributed environments, as well as 
due to numerous domain-specific factors that affect mes- 
sage latency. Prompt detection of the root-cause of exces- 
sive message delay allows a distributed system to react ac- 
cordingly. This may significantly improve compliance with 
the required timeliness constraints. 

In this work, we present a novel approach for distributed 
performance monitoring of soft-real time distributed sys- 
tems. We propose to employ recent distributed algorithms 
from the statistical signal processing and learning domains, 
and to utilize them in a different context of online perfor- 
mance monitoring and root-cause analysis, for pinpointing 
the reasons for violation of performance requirements. Our 
approach is general and can be used for monitoring of any 
distributed system, and is not limited to the soft real-time 
domain. 

We have implemented the proposed framework in Trans- 
Fab, an IBM prototype of soft real-time messaging fabric. 
In addition to root-cause analysis, the framework includes 
facilities to resolve resource allocation problems, such as 
memory and bandwidth deficiency. The experiments demon- 
strate that the system can identify and resolve latency prob- 
lems in a timely fashion. 



1. Introduction 

The number of distributed systems with latency require- 
ments rapidly grows. In several domains, such as military, 
industrial automation, and financial markets, message la- 
tency plays a critical role. Since it is technically hard and in 
most cases costly to guarantee that each and every message 
is delivered within a predefined period of time (hard-real 



time), many applications impose weaker requirements by 
allowing a small portion of messages to exceed their dead- 
line (soft real-time). Still, as a consequence of application 
complexity and the volatile nature of the distributed envi- 
ronment, even compliance with these weaker constraints is 
a challenging task. Unexpected activity bursts, message loss 
due to unreliable communication medium, network buffer 
overflow, network congestion, resource sharing and many 
other unpredictable factors may result in significant increase 
in the end-to-end message delay. 

It is highly desirable that a distributed system adapts to 
the changing conditions and thus avoids violations of the 
latency constraints. A crucial step towards achieving this 
is enabling the system to identify the root-cause whenever 
there is a degradation in performance. This, by itself, is a 
non-trivial problem, because in this context a symptom may 
be easily mistaken for the real cause or misinterpreted. For 
example, packet drop resulting from buffer space deficiency 
on the receiver side may be attributed to packet loss due to 
network congestion. 

Distributed systems with latency constraints often em- 
ploy resource reservation to ensure that the more critical 
components are served more promptly. The reserved re- 
sources are commonly the memory space, bandwidth and 
CPU share. In many cases, readjustment of the resource 
quotas can alleviate the timeliness issues. For instance, if a 
certain component rapidly generates messages, its may ex- 
ceed its transmission bandwidth limit and hence may have 
to queue messages, rather than transmitting them immedi- 
ately. Consequently, the delayed messages may miss their 
delivery deadline. This may be avoided by temporary in- 
creasing the component's bandwidth, if possible. 

The ideas above lead us to devise a framework that mon- 
itors distributed system performance, determines the root- 
cause of the increased delay, and takes corrective actions 
in order to avoid violation of the timeliness constraints. 
We propose a monitoring framework which employs a dis- 
tributed root-cause analysis. A significant advantage of the 
statistical approach is that, in contrast to the expert knowl- 



edge methods, it is independent of the system characteris- 
tics such as operating system, transport protocol and net- 
work structure. Moreover, it requires a minimal domain- 
specific knowledge to accurately determine the root-cause. 
Our primary design goals were the following: 

• System operation should be distributed, without a cen- 
tralized computing node. 

• The system should adapt to network changes as 
quickly as possible. 

• The system should not rely on software implemen- 
tation, OS and networking details ("black-box" ap- 
proach). 

In the current work, we make the novel contribution of 
borrowing recent algorithms from the field of statistical sig- 
nal processing HI |9l to be employed in a different context 
of a distributed monitoring framework. By utilizing those 
algorithms we are able to efficiently and distributively char- 
acterize the behavior of the varying network conditions as 
a stochastic process, and to perform root-cause analysis for 
detecting the parameters which cause an increased latency. 

The framework works as follows. Each node monitors a 
large number of various local operating system and appli- 
cation parameters. If a degraded performance is observed 
anywhere in the network, the nodes jointly characterize the 
performance by regarding it as a linear stochastic process, 
using statistical signal processing tools. Subsequently, a 
joint root-cause analysis computation is performed to iden- 
tify the parameters which affect performance. Once the rea- 
sons for degradation are known, a corrective action is taken 
(whenever possible), by adjusting the resource quota of one 
or more nodes. 

The root-cause analysis technique is general and can be 
applied in many other distributed systems, and it is not lim- 
ited to the soft real-time domain. The main performance 
measures are tunable and can be set, for example, to CPU 
consumption, bandwidth utilization etc. One of the appeal- 
ing properties of our monitoring is that it can be used for 
debugging as well - detecting anomalous software behav- 
ior like bugs and deadlocks. It can be further used for load 
balancing, minimization of deployed resource, hot-spot de- 
tection etc. 

We have implemented the proposed framework in Trans- 
Fab, a prototype of soft real-time messaging transport fab- 
ric, developed in IBM Research Lab. We have tested our 
framework in various settings and on different topologies. 
The experiments show that the proposed scheme accurately 
identifies the reasons for performance degradation in non- 
trivial scenarios. Overall, the protocol is a light-weight pro- 
tocol. The message overhead of a single root-cause analy- 
sis computation amounts to only several kilobytes per com- 
municating node, which is negligible in most contemporary 



networks. We have further observed only a minor increase 
in CPU consumption and memory. 

Our technique can scale up to large domains, in a hierar- 
chical manner, where each sub domain performs monitoring 
locally, filters out the relevant parameters which affect per- 
formance, and then the algorithm is run again between the 
different domains. 

This paper is organized as follows. Section |2] describes 
related previous work. Section [3] outlines the mathemat- 
ical background required for understanding our construc- 
tion. Section m presents our framework. Experimental re- 
sults of a real LAN deployment are discussed in Section |5j 
We conclude in Section |6l 

2. Related work 

Recently, there has been a lot of research targeted on 
monitoring and allocation of resources in communication 
networks utilizing techniques from statistics, learning and 
data mining domains (see for example ifTOl |2l E] [8] |6] ID). 
One possible approach for having strict performance guar- 
antees of a distributed software application is to use over 
provisioning, where the required host and network re- 
sources are allocated ahead ||6l, avoiding cases of resource 
congestion. In contrary, in the current work, we assume a 
dynamic model, where software behavior and resource re- 
quirements are not known ahead. 

Other works use a centralized computation for comput- 
ing the best allocation of resources lITOl lSl. while we assume 
a distributed computing model. In our model there is no 
central server where all the information is shipped and pro- 
cessed in. Rish et al. Q optimizes topology construction 
for optimizing download speed of a Peer-to-Peer network. 
In the current work, we assume there is some given topol- 
ogy of the communication flows between the participating 
nodes as designed by the application builder, and perform 
the monitoring on top of the given topology. Our monitor- 
ing framework can be applied in any given topology, includ- 
ing graphs with cycles. 

Resource Bundles H is an example of a successful ap- 
proach for finding and clustering similar available resources 
over the WAN, and is closely related to this work. In the Re- 
sourceBundles system, resources are captured daily to form 
a resource utilization histogram. Available resources are ag- 
gregated for providing users smart selection of groups of re- 
sources. Not surprisingly, it is shown that adding historical 
information about resource consumptions improves signif- 
icantly selection of resources. In the current paper, our fo- 
cus is on real time resource allocation: we capture resource 
usage in a time window of seconds (rather than daily). Fur- 
thermore, we use a characterization of the process as a lin- 
early noisy stochastic process. This allows greater flexibil- 
ity in describing the process behavior over time, by charac- 
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terizing joint covariance of measured parameters (relative 
to histograms used by ResourceBundles). Furthermore, our 
framework is also capable of performing root-cause analy- 
sis of parameters which affect system performance. 

3. Mathematical Background 

The following Section briefly overviews the mathemat- 
ical background needed for describing the algorithms de- 
ployed. Section |4] explains how those algorithms are used 
in the monitoring context. 

3.1. The Kalman filter algorithm 

The Kalman Filter is an efficient iterative algorithm that 
estimates the state of a discrete-time controlled process 
X £ i?" that is governed by the linear stochastic difference 
equation 

xk = Axk-i + Wk-i, (1) 

with a measurement z G i?'" that is Zk = Hxk + Vk- 
The random variables Wk and Vk that represent the pro- 
cess and measurement AWGN noise (respectively). p{w) ~ 
A/'(0, Q),p{v) - A/'(0, R). The discrete Kalman filter up- 
date equations are given by ifTTI : 
The prediction step: 

xl = Axk-i, (2a) 
P- = APk-iA^ + Q. (2b) 

The measurement step: 

Kk = P^H^{HP-H^ + Rr\ (3a) 
Xk = x^ + Kk{zk - Hx^), (3b) 
Pk = {I-KkH)P^. (3c) 

where / is the identity matrix. 

The algorithm operates in rounds. In round k the esti- 
mates Kk,Xk,Pk are computed, incorporating the (noisy) 
measurement Zk obtained in this round. The output of the 
algorithm are the mean vector Xk and the covariance matrix 
Pk. 

3.2. Generalized least squares (GLS) 
method 

Given an observation matrix A of size nx k, and a target 
vector b of size Ixn, the linear regression computes a vector 
X which is the least squares solution to the quadratic cost 
function 

min ll^a: ~ bW^ ■ 

X 

The algebraic solution is x = {A'^ A)^^ A'^b. x can be re- 
ferred as the hidden weight parameters, which given the ob- 
servation matrix A, explains the target vector b. 



The linear regression method has an underlying assump- 
tion that the measured parameters are not correlated. How- 
ever, as shown in the experimental results in Section |5] the 
measured parameters are highly correlated. For example, 
on a certain queue the number of get/put operations in each 
given second are correlated. In this case, it is better to use 
the generalized least squares (GLS) method. In this method, 
we minimize the quadratic cost function 

min(^a; - bfp-\Ax - b) , (4) 

X 

where P is the inverse covariance matrix of the observed 
data. In this case, the optimal solution is 

X = (A^P-M)^M^P-15 . (5) 

which is the best linear unbiased estimator (BLUE). 

3.3. Efficient distributed computation via 
the Gaussian Belief Propagation algo- 
rithm 

Recent work IH shows how to compute distributively 
and efficiently the Kalman filter over a communication net- 
work. 

Other recent work G] |9l O show that the GLS method 
computation (Eq. |5]l can be computed efficiently and dis- 
tributively the GaBP algorithm as well. 

The Gaussian Belief Propagation (GaBP) algorithm is an 
efficient iterative algorithm for solving a system of linear 
equations of the type Ax = 6 |9i]. The input to the algorithm 
is the matrix A and the vector b, the output is the vector 
X = A^^b. The algorithm is a distributed algorithm, which 
means that each node gets a part of the matrix A and the 
vector b as input, and outputs a part of the vector x as output. 
The algorithm may not converge, but in case it converges it 
is known to converge to the correct solution. 

Because of the short space, we do not reproduce here 
any of the previous results. The interested reader is referred 
to in HI [3] 111 for complete description of the algorithms 
deployed. 

4 Our proposed construction 

Our monitoring framework is composed of four stages, 
as depicted in Figure[T] The first stage is the data collection 
stage. In this stage the node locally monitor their measur- 
able performance parameters and record the relevant param- 
eters in a local data structure. The data collection is done in 
the background every configured time frame At, and has 
minimal effect on performance. In case of a normal oper- 
ation, all messages arrive before their soft real-time dead- 
lines. Thus, there is no need to continue and compute the 
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next stages. Whenever one of the nodes detects some dete- 
rioration in performance (e.g., a message is almost late), it 
notifies the other nodes that it wishes to compute the second 
stage. 
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Figure 1. Schematic operation of the pro- 
posed monitoring frameworlt. 



sources from the operating system. The fourth stage is done 
locally and is optional, depending on the type of application 
and the availability of resources. 

Below, we give further details regarding the implemen- 
tation and computational aspects of the different stages. 

4.1. Stage I: local data collection 

In this stage, participating nodes locally monitor their 
performance every At seconds. Node record performance 
parameters, such as memory and CPU consumption, band- 
width utilization and other relevant parameters. Based on 
the monitored software, information about internal data 
structures like files, sockets, threads, available buffers etc. 
is also monitored. The monitored parameters are stored lo- 
cally, in an internal data structure representing the matrix A, 
of size n X k, where n is the history size, and k is the num- 
ber of measured parameters. Note, that at this stage, the 
monitoring framework is oblivious to the meaning of the 
monitored parameters, regarding all monitored parameters 
equally as linear stochastic noisy processes. 



The second stage performs the Kalman filter computa- 
tion distributively. The input to the second stage are the lo- 
cal data parameters collected in the first stage, and its output 
is the mean and joint covariance matrix which character- 
ize correlation between the different parameters (possibly 
collected on different machines). The underlying algorithm 
used for computing the Kalman filter updates is the GaBP 
algorithm (described in Section lSJi l. The output of the sec- 
ond stage can be also used for reporting performance to the 
application. For example, we are able to measure the mean 
and variance of the effective bandwidth. 

The third stage computes the GLS method (explained in 
Section 13.21 ) for performing regression. The target for the 
regression can be chosen on the fly. In our experiments we 
where mainly interested in the total message latency as our 
most important performance measure. The input to the third 
stage is the parameter data collected at the first stage, and 
the covariance matrix computed in the second stage. The 
output of the third stage is a weight parameter vector. The 
weight parameter vector has an intuitive meaning of pro- 
viding a linear model for the data collected. The computed 
linear model allows us to identify which parameters influ- 
ence performance the most (parameters with the highest ab- 
solute weights). Additional benefit, is that using the com- 
puted weights we are able to compute predictions for the 
node behavior For example, how an increase of 10MB of 
buffer memory will affect the total latency experienced. 

Finally, the fourth stage uses the output of the third stage 
for taking corrective measures. For example, if the main 
reason of increased latency is related to insufficient mem- 
ory, the relevant node may request additional memory re- 



4.2. Stage II: Kalman filter 

The second stage is performed distributively over the 
network, where participating nodes compute the Kalman 
filter algorithm (outlined in Section [TT])- The input to the 
computation is the matrix A recorded in the data collection 
stage, and the assumed levels of noise Q and R. The out- 
put of this computation is the mean vector x and the joint 
covariance matrix P (Eq.|3b]|3c]l- The joint covariance ma- 
trix characterizes correlation between measured parameters, 
possibly spanning different nodes. 

We utilize recent results from the field of statistical sig- 
nal processing for computing the the Kalman filter using 
the GaBP iterative algorithm (explained in Section 13.31 ) 
The benefit of using this recent efficient distributed itera- 
tive algorithm is in faster convergence (reduced number of 
iterations) relative to classical linear algebra iterative meth- 
ods. This in turn, allows the monitoring framework to adapt 
promptly to changes in the network. 

The output of the Kalman filter algorithm x is computed 
in each node locally. Each computing node has the part of 
the output which is the mean value of its own parameters. 
For reducing computation cost, we do not compute the full 
matrix P, but the rows of P which represent significant per- 
formance parameters selected ahead. 

4.3. Stage III: GLS Regression 

The third stage is performed distributively over the net- 
work as well, for computing the GLS regression (Eq. |5]). 
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The input to this stage is the joint co variance matrix P com- 
puted in the second stage, the recorded parameters matrix 
A, and the performance target b. The output of the GLS 
computation is a weight vector x which assigns weights to 
all of the measured parameters. By selecting the parame- 
ters with the highest absolute magnitude from the vector x, 
we identify which of the recorded parameters significantly 
influence the performance target. The results of this compu- 
tation is received locally, which means that each node com- 
putes the weights of its own parameters. Additionally, the 
nodes compute distributively the top ten maximal values. 
The GLS method is computed again using the GaBP algo- 
rithm (Section 13.3b . The main benefit of using the GaBP 
algorithm for both tasks (kalman filter and GLS method 
computation) is that we need to implement and test the al- 
gorithm only once. 

4.4. Stage IV: taking corrective measures 

Whenever a node detects that a local parameter com- 
puted in stage III, is highly correlated to the target per- 
formance measure, it may try to take corrective measures. 
This step is optional and depends on the application and/or 
the operating system support. Example of local system re- 
sources are CPU quota, thread priority, memory allocation 
and bandwidth allocation. Note that resources may be either 
increased or decreased based on the regression results. 

For implementing this stage, a mapping between the 
measured parameters and the relevant resource needs to be 
defined by the system designer. For example, TRANSMIT- 
TER_PROCESSJ^SIZE , the process virtual memory size is 
related to memory allocated to the process by the operat- 
ing system. Our monitoring framework (stages I - III) is not 
aware to the semantic meaning of this parameter For taking 
corrective measures, the mapping between parameters and 
resource is essential and requires domain specific knowl- 
edge. Getting back the virtual memory example above, 
the mapping Unks TRANSMITTER_PROCESSJVSIZE to the 
memory quota of the transmitter process. Whenever this pa- 
rameter is selected to by the linear regression done in stage 
III as a parameter which significantly affects performance, a 
request to the operating system to increase the virtual mem- 
ory quota is performed. 

A natural question is how much to increase / decrease 
a certain resource quota. Here, the results of Stage III are 
useful. The regression assign weights to examined system 
parameters to explain the performance target in the linear 
model. More formally. Ax « b where x is the weight vec- 
tor, A are the recorded parameters and b is the performance 
target. Now, assume Xi is the most significant parameter se- 
lected by the regression, representing resource i. It is pos- 
sible to increase Xi by let's say 20%, x ^ x + 0.2 * Xi and 
examine the result of the increase on the predicted perfor- 



mance, by using the equation b = Ax. 

Specifically, in the soft real-time systems we focus on, 
we can examine the effect of an increase of 10% in trans- 
mitter memory by computing the predicted effect on total 
message latency. In the current work we mainly experi- 
mented with memory predictions, where increase was lim- 
ited to up to 10%. We have found the linear model quite 
accurate under those settings, whenever the memory was 
the actual performance bottleneck. An area for future work 
is to investigate the applicability of predictions computed 
by the linear model on broader settings and applied to other 
resources. 

5. Experimental Results 

The TransFab messaging fabric is a high-performance 
soft real-time messaging middleware. It runs on top of 
the networking infrastructure and implements a set of pub- 
lish/subscribe and point-to-point services with explicitly en- 
forced limits on times for end-to-end data delivery. 

We have incorporated our monitoring framework as a 
part of the TransFab overlay in Java. In our experiments, the 
TransFab node recorded 190 parameters which characterize 
the current performance. Among them, memory usage, pro- 
cess information (obtained from the \proc file system in 
Linux), current bandwidth, number of incoming/outgoing 
messages, state of internal data structures like queues and 
buffer pools, number of get/put operations on them, etc. 

We have utilized the unreliable UDP transport whose 
timeliness properties are more predictable then those of 
TCP. TransFab incorporates reliability mechanisms that 
guaranties in-order delivery of messages. A transmitter dis- 
cards a message only after all receivers have acknowledged 
the receipt of the message. When a receiver detects a miss- 
ing message, it requests its retransmission by sending a neg- 
ative acknowledgement to the transmitter 

5.1. Two nodes experiment 

For testing our distributed monitoring framework, we 
have performed the following small experiment. In this ex- 
periment, our main performance measure is the total packet 
latency. A transmitter and receiver TransFab nodes run on 
two idle Pentium IV dual core AMD Opteron 2.6Ghz Linux 
machines on the same LAN. The transmitter was configured 
to send 10,000 messages of size 8Kb in a second. Mem- 
ory allocation of both nodes is 100Mb. The experiment run 
for 500 seconds, where history size n was set to 100 sec- 
onds. During the experiment, stage I (data collection) of the 
monitoring was performed every At ~ 1 second. At time 
250 seconds, the Kalman filter algorithm was computed dis- 
tributively by the nodes. The goal of this experiment is 
to show that by performing the Kalman filter computation 
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(stage II) using information collected from two nodes, we 
are able to identify which of the collected parameters influ- 
ence the total packet latency. Furthermore, we are able to 
gain insights about system performance, which could not be 
computed by using only local information. 

For saving bandwidth, nodes locally filter out constant 
parameters out of the matrix A. Thus, the input to the 
Kalman filter algorithm was reduced to 45 significant pa- 
rameters. Figure |2] presents a joint covariance matrix cal- 
culated by the Kalman filter algorithm (computed in the 
second stage) using this typical run. Column (and row) 40 
represent the total packet latency measured by the receiver. 
The covariance matrix includes parameters captured by the 
transmitter (columns 1-23) and parameters recorded by the 
receiver (columns 24-45). 




5 ID 15 20 25 30 35 40 45 

Parameter number 



Figure 2. Joint Covariance matrix computed 
distributively using two TransFab nodes. Col- 
umn (and row) 40 captures the dependency 
of total packet latency in various measured 
parameters. Warm colors (yellow to red) 
presents medium to high correlation of mea- 
sured parameters with the total latency. 

As clearly seen in Figure |2] the total latency of pack- 
ets, even in a small setup of only two nodes, on two idle 
machines, is strongly correlated with dozens of parameters. 
Furthermore, the total latency depends on parameters from 
both the sender and the receiver. The covariance matrix 
plots the dependence of pairs of parameters, where we are 
mainly interested in understanding the reasons for message 
delay. 

In practice, we have deployed the following optimiza- 
tion: we do not compute the full covariance matrix, but only 
the rows which represent correlation with the target param- 
eters (in this example only row 40). 

Figure [3] depicts the additional Kalman filter output, 
which is the mean vector Xk- In this example, the mean 
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Figure 3. Kalman filter smoothing of packets 
per second parameter. 



packets per second parameter of the same two nodes exper- 
iment. The assumed error levels Q, R define the level of 
smoothing. In our experiments we took Q, i? to be diagonal 
matrices with error level a'^ = 0.01. The mean value and 
computed variance provide the nodes with additional infor- 
mation about performance, which could be used for moni- 
toring and debugging. 

We have repeated the previous experiment, but this time 
at time 150 seconds, the transmitting machine memory was 
reduced to 2.4Mb for outgoing message buffers. At time 
155, the receiving nodes detected locally degradation in 
performance, and compute stages II (Kalman filter) and III 
(linear regression) where the history parameter was set to 
n = 100 seconds, for finding the parameters which affect 
the total packets latency. Figure|4]presents the output of the 
distributed linear regression preformed by the two TransFab 
nodes. Clearly, the transmitter low buffers are the main (3 
out of the top 5) reasons behind the increased latency. We 
deployed the GLS method (shown in Figure [S]) to improve 
the quality of ranking. This time, all seven out of the top 9 
parameters are related to transmitter buffers. 

5.2. Larger experiments 

We have performed the following experiment which 
demonstrates the applicability of our monitoring frame- 
work. The goal of this experiment was to test, given a ran- 
domly chosen faulty node, weather the monitoring frame- 
work is able to correctly identify the faulty node, and the 
type of the fault. 

At runtime, we have randomly created an overlay tree 
topology of ten nodes. A single transmitter (the root note), 
transmits a flow of 3,000 packets of size 8K per second to 
its children, with a rate of 24Mbit for each flow. The nodes 
have no global knowledge of the network topology. Specifi- 
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Figure 4. Linear regression results of a trans- 
mitter with low memory. Out of the first 5 top 
reasons for latency increase are 3 related to 
the transmitter memory buffers. 




Error 


Description 


A 


No error 


B 


Low CPU receiver 


C 


Low CPU transmitter 


D 


Channel loss 


E 


Low memory receiver 


F 


Low memory transmitter 



Table 1 . Possible host and network faults se- 
lected randomly on runtime. 



cally, the transmitter is not aware which of its direct children 
are forwarding nodes and which nodes are tree leaves. Next, 
we have selected at random one of faults listed in Table[T]to 
be assigned to one of the tree nodes. 

We have run the experiment for 500 seconds, after which 
the nodes jointly compute the regression results, where the 
target was the transmitter latency. (Transmitter latency is 
defined as the total time messages wait in transmitter buffers 
from their submission by the application until they are ac- 
tually transmitted over the wire.) 



Transmitter (T) 




Figure 6. A tree topology of ten machines. 
The topology, forwarding node and type of 
error (in this case low memory receiver) are 
randomly selected at runtime. Regression is 
performed jointly on all ten nodes, where the 
target is the transmitter latency. 



Figure 5. Improved regressions results for a 
transmitter with low memory, using the GLS 
method. Seven out of nine top reasons for 
latency increase are related to the transmitter 
memory buffers. 



This experiment was repeated multiple times, each time 
with a different topology, a different faulty node and a dif- 
ferent fault was selected. An example topology generated at 
random is shown in Figure|6l The faulty node was assigned 
fault E - a low memory receiver. The matching regression 
results are presented in Figure |7] The results indicate, that 
the memory of the low memory node is identified as the 
first cause which affects transmitter latency. The forwarding 
node's memory is identified as the second and third causes 
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Figure 7. Regression results for thie tree 
topology with ten nodes. The low memory 
receiver (R7) memory is detected to be the 
first cause of transmitter latency. The for- 
warding node's memory (R4) is the second 
cause for transmitter latency. Finally, trans- 
mitter (T) memory is detected as the fourth 
cause which affects transmitter latency. 



of the transmitter latency. Finally, the transmitter memory 
is identified as the fourth cause of transmitter latency. Be- 
sides of detecting the faulty node, the critical path between 
the transmitter and the low memory receiver is identified 
correctly as the congested path. 

Figure [8] depicts the quality of the linear regression for 
the same experiment, comparing the actual transmitter la- 
tency with the predicted latency by the linear model. More 
formally, we first compute x using Eq. |5] and then plot Ax 
vs. b. The desired result is that the actual latency and pre- 
dicted latency computed by the linear model would be as 
close as possible to each other. However, using real world 
data it is hard to get perfect predictions, probably because 
the linear model is a simplification of the real world. This 
figure shows that the overall fit between the predicted and 
actual latency is quite good, except for some spikes in the 
actual transmitter latency which where not predicted. Note 
that prediction quality is closely related to the discussion in 
Section l4~4l regarding resource allocation in stage IV. 

We have repeated the ten nodes experiment multiple 
times, each time selecting a different topology at random 
and a different fault. Table |2] summarizes our findings. In 




Figure 8. Regression quality for the ten node 
experiment. The blue line represents actual 
transmitter latency, while the green line rep- 
resents the predicted latency using the linear 
model. 



all cases we where able to correctly identify the faulty node, 
except of the channel loss fault D. The channel loss case 
simulates a lossy channel by dropping packets uniformly at 
random with a probability of 5%. In the case of channel 
loss, the identified node was the receiver. 

Additional question we ask, is weather the hnear regres- 
sion is able to identify the type of fault. The easiest faults 
to detect where low memory faults (Eh-F), where either the 
transmitter or receiver had low memory. In those cases 
both the node and the fault reasons where identified cor- 
rectly. The case of normal behavior (fault A) we identified 
receivers latency to have highest correlation to transmitter 
latency, which is normal. 

The loss situation (fault D) where much harder to distin- 
guish from a low CPU receiver (fault B). The reason is that 
it is much harder to enforce flow control using IP multi- 
cast, since different nodes have different capabilities. In the 
case of fault B, the slow receiver is swamped with packets 
in speed higher than its processing capabilities, so packet 
loss is incurred at the operating system socket buffers level. 
From the other hand, in fault D, random packets are thrown 
in a situation which is similar to fault B. Table|2]summarizes 
the groups of faults that where not distinguishable using the 
linear regression. 

Since we where not able to distinguish between chan- 
nel loss and socket buffer overflow, we experimented with 
bursty loss model (instead of random loss). Under the 
bursty loss model, we have thrown a sequence of 100 pack- 
ets, with a probability of %1 chosen randomly in uniform. 
In this case it was much easier to distinguish between faults 
B and D by comparing the pattern of negative acknowledge- 
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Error 


Faulty node identified? 


Reason for fault identified? 


Reason for fault identified 
using domain knowledge? 


A (no error) 








B (receiver CPU) 


V 


B,D undistinguishable 


V 


C (transmitter CPU) 


V 


C,E undistinguishable 


V 


D (channel loss) 


receiver 


B,D undistinguishable 


V 


E (receiver memory) 


V 


V 




F (transmitter memory) 


V 


V 





Table 2. Summary of results for the randomly constructed ten nodes overlays. 



ments. 

We conclude that the linear regression is a very effective 
tool for identifying the faulty nodes as well as the critical 
congested path in a data dissemination overlay. However, 
in some cases it is harder to detect the reason for perfor- 
mance degradation without deploying additional tools. We 
propose to use domain specific knowledge using the data lo- 
cally collected at the faulty node in those cases. One option 
is to record normal behavior of the node locally (computed 
in stage II), so a faulty node can compare mean and variance 
for identifying locally anomalous behavior. Another option 
is to add expert knowledge, for example detecting messages 
which arrive out of sequence in large gaps for identifying 
bursty loss cases. 

5.3. Protocol overhead 

Regarding the protocol overhead, stages I and IV are per- 
formed locally with minimal computational effort, requir- 
ing no network bandwidth. Stage II and III are performed 
across the network. The GaBP algorithm typically con- 
verges in five iterations, where in each iteration each node 
sends around 2.5Kb of data to each of its neighbors. The 
GLS method requires only a single execution of the GaBP 
algorithm, where the Kalman filter requires several execu- 
tions (depends on the required accuracy). We typically set 
the number of Kalman filter iterations to 10. The memory 
and CPU requirements where found to be negligible. Speed 
of the linear regression computation was around 0. 1 second, 
while the Kalman filter step computation took up to 1 sec- 
ond. The computation are done by a separate thread, the 
avoid service interruption. 

6. Conclusion and future work 

In this work we have proposed an efficient monitoring 
framework to be used in a distributed system. Using our 
approach, we regard nodes' performance parameters as a 
stochastic process and use statistical signal processing tools 
for characterizing the process behavior. Furthermore, we 
are able to perform a root-cause analysis across the network 



for identifying anomalous behaviors and parameters which 
affect performance. 

Using a prototype software implemented in Java which 
ran distributively on up to ten computing nodes, we where 
able to demonstrate that we are able to correctly identify 
faulty nodes in a randomly generated data dissemination 
overlay, and in most cases identify the reasons for perfor- 
mance degradation. However, in some difficult cases it was 
harder to distinguish between several possible faults. In this 
case, we propose to use domain specific knowledge locally 
for identifying correctly the type of fault. 

As for future work, we plan to extend the local correction 
done at stage IV for creating a self-healing network. Cur- 
rently we focus on low memory situations where the faulty 
node increases its memory allocated. We further plan to ex- 
plore changes in local resource quotas like bandwidth and 
process priority. Another interesting extension is to deploy 
hierarchy in larger networks where the monitoring is done 
locally in each domain, and performance results are aggre- 
gated using domain hierarchy. 

The GaBP algorithm is an iterative algorithm which may 
not converge. We are now working on a variant of the algo- 
rithm which always converges to the correct answer. This 
variant will be reported in a near future work. 
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